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In  this  paper,  we  describe  a  method  of  execution  retry  for  bypassing  software  errors  based 
on  checkpointing,  rollback,  message  reordering  and  replaying.  We  demonstrate  how  rollback 
techniques,  previously  developed  for  transient  hardware  failure  recovery,  can  also  be  used 
to  recover  from  software  faults  by  exploiting  message  reordering  to  bypass  software  errors. 
Our  approach  intentionally  increases  the  degree  of  nondeterminism  and  the  scope  of  rollback 
when  a  previous  retry  fails.  Examples  from  our  experience  with  telecommunications  software 
systems  illustrate  the  benefits  of  the  scheme. 
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1  Introduction 


Numerous  checkpointing  and  rollback  recovery  techniques  have  been  proposed  in  the 
literature  to  recover  from  transient  hardware  failures.  Independent  checkpointing  schemes 
[1,2]  allow  maximum  process  autonomy  and  general  nondeterministic  execution,  but  suffer 
from  potentieil  domino  effect  [3].  Coordinated  checkpointing  schemes  [4,5]  eliminate  the 
domino  effect  by  sacrificing  a  certain  degree  of  process  autonomy  and  paying  the  cost  of 
extra  coordination  messages.  Recently,  a  lazy  checkpoint  coordination  technique  [6]  has  been 
proposed  2is  a  mechanism  for  boimding  rollback  propagation  and  providing  a  flexible  trade-off 
between  run-time  coordination  overhead  and  recovery  efficiency. 

Log-based  recovery  provides  another  way  of  achieving  domino-free  recovery.  Under  the 
piecevjise  deterministic  model  [7],  the  domino  effect  is  avoided  through  message  logging  emd 
deterministic  repla)ring.  In  a  pessimistic  logging  protocol  [8,9],  each  message  is  logged  upon 
receipt  which  prevents  the  rollback  of  a  faulty  process  from  causing  the  rollback  of  any 
other  process.  Optimistic  logging  protocols  [7,10-12]  have  been  proposed  to  reduce  run¬ 
time  overhead  by  using  asynchronous  message  logging  at  the  expense  of  possible  rollback 
propagation  due  to  lost  volatile  niessage  logs  upon  failure. 

Instead  of  proposing  another  checkpointing  and  recovery  protocol,  this  paper  investigates 
the  possibility  of  applying  the  log-based  techniques  to  recovery  from  software  errors  [7,  IS¬ 
IS].  We  previously  proposed  message  reordering  for  changing  the  communication  pattern  at 
run-time  in  order  to  reduce  the  rollback  distance  for  hardware  failures  [17].  In  this  paper,  we 
demonstrate  how  message  reordering  can  <ilso  provide  an  effective  way  of  bypassing  software 
errors.  Fig.  1  illustrates  the  basic  concept.  When  a  software  error  is  detected  at  the  point 
marked  “X”,  rollback  and  message  replaying  based  on  the  complete  checkpoint  and  message 
log  information  may  lead  to  the  same  error.  By  intentionally  discarding  part  of  the  message 
logs,  we  can  deterministically  reconstruct  the  system  state  up  to  the  dotted  line  shown  in 


2 


Fig.  1,  amd  then  use  message  reordering  to  introduce  nondeterministic  execution  beyond  the 
dotted  line  in  order  to  bypass  the  software  error.  Unlike  the  recovery  block  approach  [3]  and 
the  N-version  prograimming  [18]  which  both  use  different  programs  to  execute  on  the  same 
set  of  data,  the  above  on-line  retry  approach  [14, 19]  uses  the  same  program  to  operate  on  a 
different  but  consistent  set  of  data  obtained  through  message  reordering. 


Figure  1:  Nondeterministic  execution  beyond  the  dotted  line  through  message  reordering. 

Based  on  our  experience  with  telecommunications  software  systems,  the  technique  of 
execution  retry  with  rollbaM;k  amd  message  replaying  hais  demonstrated  its  usefulness  for 
bypassing  the  so-called  software  boundary  errors.  Usually,  an  application  contains  a  main 
routine  that  performs  the  designated  functions,  and  some  boundary  code  for  handling  situa¬ 
tions  such  as  program  exceptions,  resource  failures,  urgent  or  unexpected  messages,  failures 
on  system  or  function  calls,  etc.  In  many  application  programs  that  we  have  observed,  the 
boundary  code  is  usually  not  well  tested  due  to  the  difficulty  in  creating  such  boundary 
conditions  in  a  test  environment  [15].  Consequently,  the  possibility  of  softwwe  errors  in  the 
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boundary  code,  called  the  software  boundary  errors,  can  potentially  be  higher  than  that  in 
the  main  routine  [15].  These  kinds  of  software  boundary  errors  may  cause  a  catastrophic 
event  such  as  the  AT&T  4ESS  switching  systems  failure  on  January  1990  [20].  The  fact  that 
a  software  boundary  condition  usually  occurs  very  rarely  also  suggests  that  if  a  boundary 
error  does  occur,  then  on-line  retry  by  replaying  and/or  reordering  the  incoming  messages 
may  be  helpful  in  bypassing  the  boundary  condition. 

The  simplest  approach  to  execution  retry  is  to  roll  back  the  entire  system  and  restart 
from  a  consistent  global  checkpoint.  This  can  result  in  nondeterministic  execution  in  a 
distributed  message-passing  environment  and  this  nondeterminism  may  result  in  bypassing 
the  boundary  condition.  However,  it  is  often  desirable  to  limit  the  scope  of  rollback,  the 
number  of  involved  processes  as  well  as  total  rollback  distance,  in  order  to  achieve  faster 
recovery  [21].  It  is  possible  that  a  small-scope  rollback  involving  only  a  few  processes  suffices 
for  successful  retry.  This  motivates  our  progressive  retry  idea  which  progressively  increases 
the  scope  of  rollback  to  intentionally  introduce  more  nondeterminism  when  a  previous  retry 
fails.  Such  an  idea  has  been  implemented  in  some  telecommunication  systems  softweire  and 
has  been  shown  to  improve  the  availability  of  the  systems.  The  objective  of  this  paper 
is  to  describe  and  formalize  the  concept  of  progressive  retry  with  message  reordering  to 
bypass  software  errors  and  to  present  a  systematic  method  for  implementing  the  retry.  The 
technique  is  being  built  into  an  existing  fault  tolerance  library  [22]  in  order  to  facilitate 
future  software  development. 


2  Logical  Checkpoints  and  Recovery  Line 

Let  N  be  the  number  of  processes  in  the  system.  Suppose  pi  in  Fig.  2  initiates  a  rollback 
at  the  point  marked  “X”.  In  a  general  nondeterministic  execution,  the  rollback  of  pi  to  its 
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checkpoint  C  will  unsend  messages  Mi  and  M3,  and  thus  require  po  to  loll  back  to  a  state 
before  the  receipt  of  Mi  in  order  to  unreceive  Mi  and  similarly  require  p2  to  unreceive  M3; 
otherwise,  Mi  and  M3  are  recorded  as  “received  but  not  yet  sent”,  which  results  in  the 
inconsistency  of  system  state. 


Logical  checkpoint 

(a)  (b) 


Figure  2:  State  consistency  (a)  example  checkpoint  and  communication  pattern  (b)  logical 
checkpoint  dependency  and  recovery  line. 

However,  if  pi  retains  enough  information  and  can  guarantee  to  resend  Mi  during  its 
reexecution^,  the  execution  of  po  based  on  the  processing  of  Mi  is  still  valid  and  therefore  po 
need  not  roll  back.  This  can  be  achieved  by  the  piecewise  deterministic  model  and  eulditionad 
message  logging  and  replaying.  The  piecewise  deterministic  model  says:  process  execution 
between  two  consecutive  message  receipts,  called  a  state  interval,  is  deterministic.  So  if  pi  has 
logged  both  the  message  content  and  the  state  interval  index  [10]  (i.e.,  the  processing  order) 
for  messages  Mo  (but  not  for  M2)  by  the  time  it  initiates  the  rollback,  pi  can  deterministically 
reconstruct  the  state  up  to  immediately  before  the  receipt  of  M2  (an  nondeterministic  event) 
amd  therefore  Mi  remains  a  valid  message. 

A  useful  way  to  unify  these  two  seemingly  different  state  consistency  concepts  is  to 
introduce  the  notion  of  a  logical  checkpoint.  While  a  physical  checkpoint  like  checkpoint  C 

^  Under  the  fail-stop  [23]  assumption. 
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allows  the  restoration  of  process  state  at  the  point  the  checkpoint  waa  taken,  checkpoint 
C  and  the  message  log  of  Mq,  plus  the  underlying  piecewise  deterministic  model  effectively 
place  a  logical  checkpoint  at  the  end  of  the  state  interval  started  by  Mo  (as  shown  in  Fig.  2(a)) 
because  of  the  capability  of  state  reconstruction.  In  other  words,  although  pi  “physicadly” 
rolls  back  to  checkpoint  C,  it  “logically”  rolls  back  to  the  above  logical  checkpoint  and 
therefore  does  not  unsend  Mi.  For  cleuity,  we  let  each  physical  checkpoint  initiate  a  new 
state  interval  and  represent  it  by  a  logical  checkpoint  at  the  end  of  that  interval.  Based  on 
the  above  notion  of  logical  checkpoints,  the  following  rollback  propagation  rale^  is  then  veJid 
with  or  without  the  piecewise  deterministic  model: 

if  the  sender  rolls  back  and  unsends  a  message  M,  the  receiver  must 
also  roll  back  to  unreceive  M. 

We  define  a  global  checkpoint  as  a  set  of  IV  (logical)  checkpoints,  one  from  each  process. 
A  consistent  global  checkpoint  is  a  global  checkpoint  that  does  not  contain  amy  two  check¬ 
points  violating  the  above  rollback  propagation  rule.  The  recovery  line  is  the  latest  available 
consistent  global  checkpoint  which  uniquely  minimizes  the  total  rollback  distance.  As  an 
illustration,  suppose  all  the  messages  in  Fig.  2(a)  except  for  A/j  are  logged  when  p\  initiates 
the  rollback.  Fig.  2(b)  shows  the  dependency  graph  for  the  available  logical  checkpoints.  By 
starting  with  the  set  of  the  last  logical  checkpoints  of  each  process  and  applying  the  rollbau;k 
propagation  rule  described  above,  we  can  determine  the  recovery  line  to  be  the  set  of  shaded 
checkpoints  in  Fig.  2(b). 

*In  contrast,  when  the  receiver  rolls  back  and  unreceives  a  message  M',  the  sender  does  not  have  to  roll 
back  if  A/'  is  logged  and  can  be  retrieved  by  the  receiver  during  reexecution. 
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3  Progressive  Retry  for  Bypassing  Software  Errors 


We  base  our  discussions  on  the  following  system  model  and  recovery  protocol. 

FIFO  channel  ;  messages  sent  along  the  same  channel  between  any  two  processes  are 
ordered  by  monotonic2dly  increasing  sequence  numbers 

Nondeterministic  merge  component  [7];  messages  from  all  incoming  channels  are  merged 
by  the  merge  component  based  on  a  possibly  nondeterministic  merge  function,  and  «ire 
assigned  the  state  interval  indices 

Logging  before  processing  :  every  message  is  logged  before  delivery  to  the  application 
process^ 

Direct  dependency  tracking  [10,24,25]:  only  the  dependency  of  the  receiver’s  logical 
checkpoint  on  the  sender’s  logical  checkpoint  resulting  from  each  message  processing 
is  recorded,  as  opposed  to  the  transitive  dependency  tracking  which  has  been  used  in 
many  log- based  papers  [7,11]. 

Centralized  recovery  line  computation  :  the  global  dependency  information  is  col¬ 
lected  by  a  single  process  [2, 10]  which  is  responsible  for  the  recovery  line  computation"* 

3.1  Recovery  Line  and  Message  Logs 

With  respect  to  the  recovery  line  consisting  of  the  shaded  checkpoints  shown  in  Fig.  3, 

messages  can  be  classified  into  four  categories. 

®Our  work  can  be  extended  to  systems  with  asynchronous  (optimistic)  message  logging  by  making  addi¬ 
tional  logical  checkpoints  unavailable  for  those  lost  volatile  message  logs  due  to  the  failure. 

distributed  and  synchronised  algorithm  has  been  proposed  by  Sistla  and  Welch  [llj.  A  distributed 

and  asynchronous  algorithm  can  be  found  in  Strom  wd  Yemini’s  pim>er  [7]. 
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1.  Obsolete  messages:  In  order  to  reconstruct  the  state  up  to  the  recovery  line,  the 
system  can  restart  from  the  set  of  restarting  checkpoints,  called  the  restart  line,  as 
illustrated  in  Fig.  3.  Messages  that  were  processed  before  the  rest<urt  line,  for  example. 
Mo,  are  therefore  obsolete  messages  and  not  useful  for  recovery. 

2.  Messages  for  deterministic  replay  (deterministic  messages):  Messages  pro¬ 
cessed  between  the  restart  line  and  the  recovery  line  must  have  both  their  message 
contents  and  state  interval  indices  logged.  These  messages  need  to  be  replayed  in  their 
original  order  for  deterministic  state  reconstruction.  Md  and  M'q  axe  such  messages. 

3.  In-transit  (or  channel-state)  messages:  For  messages  sent  before  the  recovery  line 
and  processed  after,  only  the  message  contents  in  the  log  are  valid.  The  state  interval 
indices  are  either  not  logged  or  invalidated.  Messages  like  Mi  and  Mj  belong  to  this 
category  and  can  be  processed  in  arbitrary  order. 

4.  Orphan  messages:  Messages  sent  after  the  recovery  line  are  orphan  messages.  M'^ 
can  not  exist  because  otherwise  the  recovery  line  is  not  consistent.  Mr  is  invalidated 
by  the  rollback  and  should  be  discarded. 

In  an  optimistic  logging  protocol  [7],  rollb2M:k  propagation  can  result  from  the  nondeter¬ 
minism  due  to  the  lost  volatile  message  logs  upon  failure.  Based  on  the  available  message  logs 
from  stable  storage,  the  recovery  line  is  uniquely  determined  and  each  message  must  stati¬ 
cally  belong  to  one  of  the  four  categories  depending  on  its  position  relative  to  the  recovery 
line.  In  contrast,  our  retry  technique  progressively  increases  the  degree  of  nondeterminism 
and  the  scope  of  rollback  by  discarding  more  message  logs  as  a  previous  retry  fails.  At  each 
step,  a  new  recovery  line  or  restart  line  is  computed  b2ised  on  the  remeuning  checkpoint  and 
message  log  information.  Since  the  recovery  line  moves  backward  in  time  during  the  pro¬ 
gressive  retry,  messages  belonging  to  the  ith  category  can  shift  to  the  jth  category,  where 
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Figure  3:  Example  obsolete,  deterministic,  in-transit  and  orphan  messages, 
j  >  i,  at  a  later  stage. 

3.2  Progressive  Retry 

We  will  use  the  example  checkpoint  and  communication  pattern  shown  in  Fig.  4  to  illustrate 
progressive  retry  in  five  steps.  We  assume  one  retry  per  step  and  no  hardware  failure  for  the 
following  discussion. 

Step  1  -  Receiver  deterministic  retry;  When  pj  detects  an  error,  it  first  initiates  a 
local  recovery  by  rolling  back  to  checkpoint  C  and  deterministically  replaying  the 
message  logs.  Because  every  message  is  logged  before  processing,  message  logs  for  Afa, 
and  Mg  must  be  available  and  allow  pj  to  reconstruct  the  state  up  to  the  point 
it  detected  the  error,  as  illustrated  by  the  recovery  line  shown  in  Fig.  4(a).  In  some 
cases,  transient  failures  may  be  caused  by  some  environmental  factors  which  will  simply 
disappear  after  the  recovery,  emd  the  Step-1  retry  may  succeed.  If  the  reexecution  still 
leads  to  the  same  error,  the  checkpoint  and  message  log  information  is  copied  to  a 
trace  file  for  off-line  debugging  and  Step  2  is  initiated. 
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(a) 


(b) 
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(d) 


Figure  4:  Progressive  retry  (a)  Step  1;  receiver  deterministic  retry  (b)  Step  2:  receiver  non- 

deterministic  retry  (c)  Step  3:  sender  deterministic  retry  (d)  Step  4:  sender  nondeterministic 
retry. 
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step  2  -  Receiver  nondeterministic  retry:  starts  introducing  nondeterminism  by 

discarding  the  state  interval  indices  of  Ma,  Mk  and  Me  in  order  to  allow  message 
reordering.  As  a  result,  the  last  three  logical  checkpoints  of  pj  are  now  unavailable 
and  the  resulting  recovery  line  is  shown  ir  Fig.  4(b).  Notice  that  only  Ma  and  M(, 
are  in-transit  messages  available  for  reordering;  message  Mg  now  becomes  an  orphan 
message  amd  should  be  discarded. 

Message  reordering  cam  be  achieved  by  restoring  the  in-transit  messages  to  the  input 
of  nondeterministic  merge  function  and  re- assigning  them  vith  possibly  different  state 
interval  indices.  An  alternative  is  to  group  the  messages  from  the  same  process  together 
if  the  software  bug  is  possibly  due  to  concurrency  control. 

Step  3  -  Sender  deterministic  retry:  Messages  that  have  been  received  but  not  yet 
logged,  i.e.,  still  in  the  system  queue,  can  be  lost  upon  failure.  Message  Md  in  Fig.  4 
is  an  example.  Such  lost  messages  can  be  detected®  when  the  receiver  receives  another 
message  from  the  saune  sender  which  indicates  a  discontinuity  in  the  message  sequence 
number  [7].  The  sender  is  then  requested  to  resend  the  message  if  sender  logging  is 
available  [7],  or  to  regenerate  the  message  through  deterministic  state  reconstruction 
[11]. 

The  immediate  recovery  of  such  lost  messages  is  useful  for  increasing  the  number  of 
messages  avaulable  for  reordering,  now  discairds  the  message  contents  of  Ma  and  Mk 
as  well.  Although  the  resulting  recovery  line  as  shown  in  Fig.  4(c)  is  the  same  as  the 
one  in  (b),  ps  in  addition  to  pi  and  is  rolled  back®  in  order  to  regenerate  (recover)  the 
lost  message  Md..  Agaiin,  if  sender  logging  is  available,  ps  cam  simply  resend  messages 

*  For  some  applications,  lost  messages  may  be  acceptable.  For  example,  if  the  lost  message  is  a  channel 
request  message  in  a  telephone  switching  application,  the  user  will  simply  redial  or  try  again  later. 

P2  can  notify  pj  to  roll  back  by  sending  ps  the  largest  sequence  number  of  any  message  sent  from  ps  and 
received  by  pj  before  checkpoint  C.  Similar  messages  are  sent  to  all  other  processes. 
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Mo  and  M<j  without  rolling  back. 


Step  4  -  Sender  nondeterministic  retry:  When  reordering  Mo,  Mb,  Mj  and  possibly 
other  newly-arriving  non-orphan  messages  still  fmls  to  bypass  the  software  error,  p2 
suspects  some  of  these  messages  should  not  have  been  generated  in  the  first  place. 
Therefore,  pi  and  ps  are  requested  to  roll  back  further  by  discarding  the  state  interval 
indices  of  the  message  logs  that  can  deterministically  generate  these  messages.  The 
resulting  recovery  line  is  given  in  Fig.  4(d).  Nondeterminism  can  be  introduced  by  pi 
reordering  M„  and  M„„  and  pa  reordering  M*,  My  and  M*. 

Step  5  -  Large-scope  rollback  retry:  When  ail  previous  small-scope  retries  fail,  a  large- 
scope  rollback  can  be  initiated.  Instead  of  backing  off  a  few  state  intervals  for  reorder¬ 
ing  a  small  number  of  messages  involving  a  small  number  of  processes,  all  processes  in 
the  system  axe  requested  to  roll  back  K  intervals  where  K  should  be  a  large  number 
compared  to  Step  1  through  Step  4.  The  recovery  line  computed  from  the  remaining 
available  logical  checkpoints  is  then  used  for  the  final-step  retry. 

The  choice  of  /f  is  a  trade-off  between  output  commit  eind  g2U'bage  collection  versus 
the  available  nondeterminism.  Outputs  to  the  outside  world  that  cannot  be  rolled 
back  should  only  be  released  after  the  recovery  line  has  advanced  beyond  the  state 
intervals  that  generate  these  outputs.  Checkpoints  and  message  logs  can  only  be 
garbage-collected  after  the  restart  line  has  passed  their  corresponding  state  intervals 
[11].  Therefore,  while  a  larger  K  means  more  nondeterminism  is  avaulable,  it  also  results 
in  slower  output  commit  and  less  effective  garbage  collection,  which  are  translated  into 
slower  response  to  the  users  and  lau'ger  space  overhead,  respectively.  In  the  extreme 
case  where  fast  output  commit  is  the  most  important  requirement  for  the  system, 
only  those  state  interv2ds  beyond  the  last  output  C2m  be  backed  off  for  introducing 
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nondeterminism  [7]. 


4  Experience  and  Discussion 

In  this  section,  we  describe  two  examples  from  telecommimications  software  with  software 
boundary  errors.  These  errors  resulted  in  program  hang-up  or  program  crash.  However,  by 
using  the  progressive  retry  technique  (Step  2  for  Case  1  and  Step  3  for  Case  2),  these  programs 
were  able  to  quickly  recover  from  the  errors.  To  simplify  the  description,  we  have  abstracted 
only  the  components  which  contributed  to  the  errors.  These  software  errors  were  later  found 
and  fixed.  These  examples  show  that  even  before  the  software  faults  were  repaired,  the 
software  errors  did  not  interrupt  the  services,  due  to  progressive  retry. 

Case  1 

In  a  file  replication  mechanism,  all  “open”,  “write”  and  “close”  system  calls  are  trapped 
by  the  primary  node  and  passed  to  an  agent  process  on  a  backup  node.  The  agent  process 
performs  the  system  cadis  to  replicate  files.  The  agent  process  opens  a  file  when  an  open 
command  is  received  and  closes  a  file  when  a  close  commamd  is  received.  There  is  only  one 
agent  process  to  serve  many  applications  on  the  primary  node.  Since  the  number  of  available 
file  descriptors  for  the  agent  process  is  limited  aind  each  application  process  could  open  many 
files  at  the  same  time,  the  agent  process  may  run  out  of  file  descriptors.  Therefore,  it  has  to 
keep  track  of  how  many  files  are  open.  A  boundary  condition  for  the  agent  process  occurs 
when  all  file  descriptors  are  used.  The  agent  process  then  searches  for  an  open  file  descriptor 
with  the  earliest  access  time  and  closes  that  file. 

A  software  bug  existed  in  the  search  procedure  so  that  once  the  agent  process  entered  the 
boundary  condition,  the  search  process  never  finished  and  the  agent  process  hung  up.  The 
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agent  process  implemented  the  checkpointing  tind  logging  mech2inism  and  had  an  external 
hang-up  detection  mechanism.  Once  the  agent  process  entered  the  bound^uy  condition,  the 
failure  was  detected  ctnd  the  agent  process  was  rolled  back.  When  the  agent  process  was 
restarted,  it  restored  the  checkpointed  state,  reordered  cuid  reexecuted  the  message  logs. 
Once  the  messages  were  reexecuted,  the  agent  program  was  able  to  continue  its  operation. 

The  following  example  illustrates  how  progressive  retry  functions  in  this  instance.  Let 
ol  commeuid  stand  for  opening  file  1,  wl  command- stand  for  writing  data  to  file  1  and  el 
command  stand  for  closing  file  1.  The  agent  process  can  open  at  most  2  files  at  the  same 
time.  The  following  command  sequence  will  cause  the  agent  prograun  to  enter  the  boundary 
condition  when  processing  o3  and  hang  up. 
ol  o2  wl  wl  w2  wl  w2  o3  w3  cl  c2 

Suppose  the  logging  mechanism  had  logged  all  the  commands  before  the  failure.  When  the 
agent  process  is  restarted,  the  command  log  may  be  reordered  with  the  following  sequence: 

ol  wl  wl  wl  cl  o2  w2  w2  c2  o3  w3. 

In  this  sequence,  the  boundary  condition  never  occurs,  and  therefore  the  reexecution  of  the 
command  log  succeeds. 

Case  2 

In  a  cross-connection  system,  a  process  (BK)  is  used  to  track  the  available  channels  in 
the  switch.  The  BK  process  gets  information  from  two  other  processes:  process  (CA)  which 
sends  the  channel  allocation  requests  and  process  (DA)  which  sends  the  channel  deallocation 
requests.  A  boundary  condition  for  BK  occurs  when  all  channels  are  used  and  the  process 
receives  additional  allocation  requests.  In  that  case,  a  clean-up  procedure  is  called  to  free 
up  some  channels  or  to  block  further  requests.  However,  the  clean-up  procedure  contained 
a  software  glitch  which  could  cause  the  process  to  crash. 
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The  cross-connection  system  uses  a  daemon  watcher  to  detect  a  process  failure  and  em¬ 
ploys  checkpointing  and  message  logging  mechzmism  to  recover  from  the  fmlure.  The  fol¬ 
lowing  example  illustrates  how  progressive  retry  works  in  this  system.  Suppose  the  number 
of  available  is  5.  The  command  r2  stands  for  requesting  two  chzmnels,  amd  the 

command  f  2  stands  for  freeing  two  channels.  The  following  command  sequence  can  cause 
the  BK  process  to  crash  because  of  the  boundary  error. 

CA  sends  r2  r3  rl 
DA  sends  f2  f3  fl 
BK  receives  r2  r3  rl  and  crash 

If  the  message  f2  is  received  and  logged  before  BK  crashes,  BK  will  be  able  to  recover 
by  reordering  the  message  logs.  However,  if  BK  crashes  before  the  f2  message  is  logged, 
reordering  messages  r2,  r3  and  rl  (Step  2)  will  not  help.  In  this  case,  the  local  recovery  of 
BK  fails  and  CA  and  DA  will  be  requested  to  resend  their  messages  (Step  3).  Because  of 
the  nondeterminism  in  operating  system  scheduling  md  communication  delay,  the  messages 
may  arrive  at  BK  in  a  different  order.  For  example,  the  message  order  cam  be 

r2  r3  f2  f3  rl  fl 

Since  the  boundairy  error  does  not  occur  in  this  case,  the  progressive  retry  involving  three 
processes  succeeds. 


5  Concluding  Remarks 

We  have  described  a  method  of  applying  the  log-baaed  recovery  technique,  previously 
developed  for  fail-stop  haadwaae  faulures,  to  recovery  from  transient  software  errors.  Our 
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five-step  progressive  retry  approa^'h  discards  partial  message  log  information  at  each  step  in 
order  to  introduce  an  increasing  degree  of  nondeterminism  for  bypassing  software  errors.  Al¬ 
though  not  every  software  error  can  be  recovered  through  message  resending,  reordering  and 
replaying,  we  have  observed  that  progressive  retry  can  provide  am  effective  way  of  recovering 
from  boundary  errors  in  long-life  software  systems. 

The  techniques  described  axe  being  implemented  in  the  fault  toleramce  library  libft 
which  ha«  been  developed  in  AT&T  Bell  Laboratories  [22].  Libft  is  a  C  librairy  which 
supports  N-version  programiming,  recovery  blocks,  exception  handling,  message  logging,  and 
checkpointing  and  rollback  recovery,  and  hais  been  used  by  severed  AT&T  products.  Cur¬ 
rently,  the  recovery  mechamism  in  libft  provides  the  first  step  of  receiver  deterministic  retry 
with  implementation  of  the  remaining  steps  in  progress. 
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